Skip to content

Conversation

JenniferWang
Copy link
Contributor

@JenniferWang JenniferWang commented Sep 24, 2025

Thanks to @casteryh 's PR of fixing the integration test #217, we are able to see the buggy behavior of the manual sharding logic.

This diff basically showcases why the test failed and how easy it is to introduce silent data correctness issue if we keep pursuing this route of manual sharding.

The plan is land these two PRs ASAP

  1. Fix the broken test: integration test for weight sync that actually tests behavior #217
  2. Replace the manual sharding with vanilla load weight calls: Weight loading working correctly with tp: use vllm builtin load_weights() #184

At the same time, this PR fixes the existing Qwen3 (non MoE) sharding.

Before the fix:

  • test_policy_update integration test fails when tp > 1
  • grpo (Qwen 1.7B) loss function super high at the start of the trainer
image https://meta.wandb.io/jiyue/grpo-training/runs/hh3bht2w?nw=nwuserjiyue

After the fix:

  • test_policy_update integration test passes when tp > 1
  • grpo (Qwen 1.7B) loss function is much more reasonable??
image https://meta.wandb.io/jiyue/grpo-training/runs/gnvlw7dc?nw=nwuserjiyue

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 24, 2025
@allenwang28
Copy link
Contributor

@JenniferWang life pro tip, you should be able to do pre-commit install (assuming you've pip install pre-commit) which will handle the linter for you automatically every time you do a git commit

@JenniferWang JenniferWang changed the title Demonstrate the vLLM sharding bug when trainer export the weights Fix manual vLLM Qwen3 sharding bug when trainer export the weights Sep 26, 2025
@JenniferWang JenniferWang merged commit 825596b into meta-pytorch:main Sep 26, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Give a better solution to manually calculated vLLM sharding logic in sharding.py

3 participants